45 research outputs found

    f-Divergence constrained policy improvement

    Full text link
    To ensure stability of learning, state-of-the-art generalized policy iteration algorithms augment the policy improvement step with a trust region constraint bounding the information loss. The size of the trust region is commonly determined by the Kullback-Leibler (KL) divergence, which not only captures the notion of distance well but also yields closed-form solutions. In this paper, we consider a more general class of f-divergences and derive the corresponding policy update rules. The generic solution is expressed through the derivative of the convex conjugate function to f and includes the KL solution as a special case. Within the class of f-divergences, we further focus on a one-parameter family of α\alpha-divergences to study effects of the choice of divergence on policy improvement. Previously known as well as new policy updates emerge for different values of α\alpha. We show that every type of policy update comes with a compatible policy evaluation resulting from the chosen f-divergence. Interestingly, the mean-squared Bellman error minimization is closely related to policy evaluation with the Pearson χ2\chi^2-divergence penalty, while the KL divergence results in the soft-max policy update and a log-sum-exp critic. We carry out asymptotic analysis of the solutions for different values of α\alpha and demonstrate the effects of using different divergence functions on a multi-armed bandit problem and on common standard reinforcement learning problems

    On Optimal Behavior Under Uncertainty in Humans and Robots

    Get PDF
    Despite significant progress in robotics and automation in the recent decades, there still remains a noticeable gap in performance compared to humans. Although the computation capabilities are growing every year, and are even projected to exceed the capacities of biological systems, the behaviors generated using current computational paradigms are arguably not catching up with the available resources. Why is that? It appears that we are still lacking some fundamental understanding of how living organisms are making decisions, and therefore we are unable to replicate intelligent behavior in artificial systems. Therefore, in this thesis, we attempted to develop a framework for modeling human and robot behavior based on statistical decision theory. Different features of this approach, such as risk-sensitivity, exploration, learning, control, were investigated in a number of publications. First, we considered the problem of learning new skills and developed a framework of entropic regularization of Markov decision processes (MDP). Utilizing a generalized concept of entropy, we were able to realize the trade-off between exploration and exploitation via a choice of a single scalar parameter determining the divergence function. Second, building on the theory of partially observable Markov decision process (POMDP), we proposed and validated a model of human ball catching behavior. Crucially, information seeking behavior was identified as a key feature enabling the modeling of observed human catches. Thus, entropy reduction was seen to play an important role in skillful human behavior. Third, having extracted the modeling principles from human behavior and having developed an information-theoretic framework for reinforcement learning, we studied the real-robot applications of the learning-based controllers in tactile-rich manipulation tasks. We investigated vision-based tactile sensors and the capability of learning algorithms to autonomously extract task-relevant features for manipulation tasks. The specific feature of tactile-based control that perception and action are tightly connected at the point of contact, enabled us to gather insights into the strengths and limitations of the statistical learning approach to real-time robotic manipulation. In conclusion, this thesis presents a series of investigations into the applicability of the statistical decision theory paradigm to modeling the behavior of humans and for synthesizing the behavior of robots. We conclude that a number of important features related to information processing can be represented and utilized in artificial systems for generating more intelligent behaviors. Nevertheless, these are only the first steps and we acknowledge that the road towards artificial general intelligence and skillful robotic applications will require more innovations and potentially transcendence of the probabilistic modeling paradigm

    f-Divergence constrained policy improvement

    Get PDF
    To ensure stability of learning, state-of-the-art generalized policy iteration algorithms augment the policy improvement step with a trust region constraint bounding the information loss. The size of the trust region is commonly determined by the Kullback-Leibler (KL) divergence, which not only captures the notion of distance well but also yields closed-form solutions. In this paper, we consider a more general class of f-divergences and derive the corresponding policy update rules. The generic solution is expressed through the derivative of the convex conjugate function to f and includes the KL solution as a special case. Within the class of f-divergences, we further focus on a one-parameter family of α-divergences to study effects of the choice of divergence on policy improvement. Previously known as well as new policy updates emerge for different values of α. We show that every type of policy update comes with a compatible policy evaluation resulting from the chosen f-divergence. Interestingly, the mean-squared Bellman error minimization is closely related to policy evaluation with the Pearson χ²-divergence penalty, while the KL divergence results in the soft-max policy update and a log-sum-exp critic. We carry out asymptotic analysis of the solutions for different values of α and demonstrate the effects of using different divergence functions on a multi-armed bandit problem and on common standard reinforcement learning problems

    Entropic Risk Measure in Policy Search

    Get PDF
    With the increasing pace of automation, modern robotic systems need to act in stochastic, non-stationary, partially observable environments. A range of algorithms for finding parameterized policies that optimize for long-term average performance have been proposed in the past. However, the majority of the proposed approaches does not explicitly take into account the variability of the performance metric, which may lead to finding policies that although performing well on average, can perform spectacularly bad in a particular run or over a period of time. To address this shortcoming, we study an approach to policy optimization that explicitly takes into account higher order statistics of the reward function. In this paper, we extend policy gradient methods to include the entropic risk measure in the objective function and evaluate their performance in simulation experiments and on a real-robot task of learning a hitting motion in robot badminton
    corecore